Vycerpavajuci zoznam mnozstva veci, co sa da robit na pripravu dat a na feature engineering najdete tu: http://www.datasciencecentral.com/profiles/blogs/feature-engineering-data-scientist-s-secret-sauce-1
Nanestastie je to len zoznam
Celkom pekny zoznam krokov a aj metod je tu: http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/
Zopar najcastejsie pouzivanych veci z toho vyberiem a ukazem na co je to dobre
In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
plt.rcParams['figure.figsize'] = 9, 6
from IPython.display import Image
Priklad na datach o potopeni Titanicu https://www.kaggle.com/c/titanic
In [2]:
titanic = pd.read_csv('titanic/train.csv')
titanic.head()
Out[2]:
In [3]:
import re
titanic.Name.apply(lambda x: re.split('[,.]', x)).head(10)
Out[3]:
In [4]:
titanic.Name.apply(lambda x: re.split('\s*[,.]\s*', x)).head(10)
Out[4]:
In [5]:
set(titanic.Name.apply(lambda x: re.split('\s*[,.]\s*', x)[1]))
Out[5]:
In [6]:
titanic['title'] = titanic.Name.apply(lambda x: re.split('\s*[,.]\s*', x)[1])
titanic.title.head(10)
Out[6]:
In [7]:
titanic.title.value_counts()
Out[7]:
In [8]:
titanic.loc[titanic.title == 'Mlle', 'title'] = 'Miss'
titanic.loc[titanic.title == 'Mme', 'title'] = 'Mrs'
titanic.loc[titanic.title.isin(['Capt', 'Don', 'Major']), 'title'] = 'Sir'
titanic.loc[titanic.title.isin(['Dona', 'Lady', 'the Countess', 'Jonkheer']), 'title'] = 'Lady'
In [9]:
titanic.groupby('title').size().plot(kind='bar')
Out[9]:
Napriklad pri predikovani casu pristania lietadla nieje uplne dobry napad predikovat ten cas, ale skor dobu letu a potom ju len pripocitat k casu vzlietnutie
In [10]:
url = 'http://www.datasciencecentral.com/profiles/blogs/predicting-flights-delay-using-supervised-learning'
from IPython.display import IFrame
IFrame(url, width=700, height=350)
Out[10]:
In [11]:
# https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+#
# klasifikacia, ci je miestnost obsadena alebo nie na zaklade udajov zo senzorov
occupancy = pd.read_csv('occupancy/datatraining.txt', sep=',')
date = pd.to_datetime(occupancy.date, format='%Y-%m-%d %H:%M:%S')
occupancy.date = date
occupancy.head()
Out[11]:
In [12]:
occupancy['weekday'] = occupancy.date.dt.weekday
occupancy.head()
Out[12]:
In [13]:
occupancy['weekend'] = False
occupancy.loc[occupancy.weekday.isin([5, 6]), 'weekend'] = True
occupancy.tail()
Out[13]:
In [14]:
occupancy.weekend.value_counts()
Out[14]:
uzitocne napriklad pri analyze spravania sa ludi na webe - tesne poobede asi ludia citaju ine clanky ako v produktivnejsej casti dna
In [15]:
# rozne atributy, ktore sa daju vytiahnut z casovej peciatky
url = 'http://pandas.pydata.org/pandas-docs/stable/api.html#datetimelike-properties'
IFrame(url, width=1000, height=350)
Out[15]:
In [16]:
trans = pd.read_csv('berka/trans.asc', sep=';', nrows=1000)
trans.head()
Out[16]:
In [17]:
date = pd.to_datetime(trans.date, format='%y%m%d')
trans.date = date
trans.head()
Out[17]:
In [18]:
means = trans.groupby('operation').amount.mean()
means
Out[18]:
In [19]:
trans['mean_amount_per_operation'] = trans.apply(lambda x: 0 if pd.isnull(x['operation']) else means[x['operation']], axis=1)
trans.tail()
Out[19]:
In [20]:
means_df = pd.DataFrame(means)
means_df.columns = ['mean_amount_per_operation']
means_df['operation'] = means_df.index
means_df.index = range(len(means_df))
means_df
Out[20]:
In [21]:
pd.merge(trans, means_df, how='left', on='operation')
Out[21]:
In [22]:
# data o komunikacii pocitacov v sieti z www.neteye-blog.com/netcla-the-ecml-pkdd-network-classification-challenge/
# uloha je klasifikovat aka aplikacia data generovala
data_file = "../vos/challenge/NetCla/data/train.csv"
netcla = pd.read_csv(data_file, nrows=1000, sep='\t')
netcla.head()
Out[22]:
In [23]:
netcla.info()
In [24]:
# seaborn.pairplot(netcla) # toto nije dobry napad. Tych atributov je strasne vela a kazdy s kazdym je dost velka matica
plt.rcParams['figure.figsize'] = 18, 12
netcla.hist()
plt.rcParams['figure.figsize'] = 9, 6
In [25]:
pom = netcla.throughput.hist(bins=50)
pom.set_title('throughput')
Out[25]:
Rozne algoritmy mozu mat s takymto rozdelenim problem.
Logisticka regresia, neuronova siet alebo hocico, co pouziva vahy na atributoch
Aku mate dat vahu atributu ak ma rozsah od 0 do 10^6 a vacsina hodnot je sustredena na jednu stranu?
Velke hodnty vam budu velmi ovplyvnovat cely vypocet a nebudete vediet rozlysit tie male.
In [26]:
from scipy.stats import boxcox
namiesto box-cox by sa dal pouzit logaritmus povodnej hodnoty, ale to je hack. box-cox sa dokaze postarat o to, aby sa vysledne rozelenie podobalo normalnemu
In [27]:
# boxcox vrati transformovane data a parametre transformacie. Tie viem zafixovat a v tom pripade mi to vrati len transformovane data
transformed, att = boxcox(netcla.throughput+1)# nevieme transformovat 0 a zaporne hodnoty. preto + 1
pom = pd.Series(transformed).hist(bins=50)
pom.set_title('throughput (box-cox)')
Out[27]:
Toto sa uz trochu podoba na normalne rozdelenie ale nie je normalizovane
In [28]:
def normalization(data, shift, scale):
return (np.array(data) - float(shift))/scale
z-normalization: shift = mean, scale = std
0-1 normalization: shift = min, scale = max - min
kvartily na odstranenie outlierov
In [29]:
z_transformed = normalization(transformed, np.mean(transformed), np.std(transformed))
pom = pd.Series(z_transformed).hist(bins=50)
pom.set_title('throughput (Z-normalization)')
Out[29]:
PCA sa snazi vysvetlit varianciu v datach co najmensim poctom atributov. Vie ovahovat atributy datasetu podla toho kolko je v nich variancie a s takou vahou ich pouziva. Ak mate nenormalizovany atribut, ktory ma varianciu ovela vacsiu ako ostatne, tak bude vo vysledkej reprezentacii odrazeny velmi silno na ukor ostatnych. Nie vzdy to chcete.
Niekedy vam samotny atribut velmi nepomoze. Ak ho ale nejak skombinujete s inym, tak uz moze.
Napriklad mate dataset aut, kde je spotreba a velkost nadrze. Tieto atributy su fajn, ale ak ich medzi sebou prenasobite, tak dostanete dojazd co moze tiez velmi dolezite pri vybere auta.
Na toto obycajne potrebujete domenovu znalost. Tento proces sa ale da do nejakej miery automatizovat.
In [30]:
X = np.arange(6).reshape(3, 2)
X
Out[30]:
In [31]:
from sklearn import preprocessing
poly = preprocessing.PolynomialFeatures(3)
poly.fit_transform(X) # vytvorenie polynomialnych kombinacii
Out[31]:
a^0, a, b, a^2, ab, b^2, a^3, a^2b, a*b^2, b^3
In [32]:
titanic = pd.read_csv('titanic/train.csv')
titanic.head()
Out[32]:
nevieme pouzivat prazdne hodnoty pri vypocte polynomialnych vlastnosti
rozdelime si data na zavisle a nezavisle premenne
In [33]:
titanic_X = titanic.dropna().reindex(columns=[x for x in titanic.columns.values if x != 'Survived']).reset_index(drop=True)
titanic_y = titanic.dropna().reindex(columns=['Survived']).reset_index(drop=True)
In [35]:
titanic_X
Out[35]:
In [36]:
poly = preprocessing.PolynomialFeatures(2)
# pozor na prilis velke cislo. Vznikne vela atributov a hrozi ze budete mat malo dat na natrenovanie
polynomial_titanic = poly.fit_transform(titanic_X)
In [37]:
polynomial_titanic = poly.fit_transform(titanic_X._get_numeric_data())
In [38]:
polynomial = pd.DataFrame(polynomial_titanic)
polynomial.head()
Out[38]:
In [39]:
titanic_X._get_numeric_data().shape
Out[39]:
ak by ste chceli zachovat rozumne nazvy stlpcov: http://stackoverflow.com/questions/36728287/sklearn-preprocessing-polynomialfeatures-how-to-keep-column-names-headers-of
In [40]:
# berieme len numericke data aby sme mali rovnake podmienky
original = titanic.dropna()._get_numeric_data()
original.head()
Out[40]:
In [41]:
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(max_depth=5)
cross_validation_results = cross_val_score(clf,
original[original.columns[original.columns != 'Survived']],
original['Survived'], cv=6)
(cross_validation_results.mean(), cross_validation_results.std())
Out[41]:
In [42]:
cross_validation_results
Out[42]:
In [43]:
# pricapime naspat info o triedach
polynomial['Survived'] = titanic_y
polynomial.head()
Out[43]:
In [44]:
from sklearn.model_selection import cross_val_score
# clf = LogisticRegression()
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(max_depth=5)
cross_validation_results = cross_val_score(clf,
polynomial[polynomial.columns[polynomial.columns != 'Survived']],
polynomial['Survived'], cv=6)
(cross_validation_results.mean(), cross_validation_results.std())
Out[44]:
In [45]:
cross_validation_results
Out[45]:
In [46]:
import numpy as np
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(np.log1p)
X = np.array([[0, 1], [2, 3]])
transformer.transform(X)
Out[46]:
Dajte si pozor aby vam do trenovania nepretiekli udaje z buducnosti
ked idem napriklad normalizovat nieco priemerom, tak hodnotu priemeru pocitam len nad trenovacimi datami a nie nad vsetkymi Pri tomto vedia velmi pomoct tzv. Pipeliny> http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html
ked budete normalizovat udaje, tak na normalizovanie testovacej vzorky pouzite koeficienty z trenovacej vzorky